[SPARK-54392][SS] Optimize JVM-Python communication for TWS initial state #53122

nyaapa · 2025-11-18T23:14:17Z

What changes were proposed in this pull request?

group multiple keys into one arrow batch;
generally will have much less batches in case of high keys cardinality.
do not group init_data and input_data in batch0: instead of it serialize init_data first, and then input_data;
in worst case we're going to have one more chunk by not grouping them together, but winning by having much simpler logic on python side.
do not create extra dataframes if not needed + copy empty one;

Why are the changes needed?

Benchmark results show that in high-cardinality scenarios, this optimization improves batch0 time by ~40%. No visible regressions for low cardinality case.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing UT and Benchmark:

10,000,000 distinct keys in init state (8xi3.4xlarge):
- Without Optimization: 11400 records/s
- With Optimization: 30000 records/s

Was this patch authored or co-authored using generative AI tooling?

No

…load

holdenk

Just started taking a look, it's been a hot minute since I looked at the Arrow serialization logic so some perhaps silly questions.

holdenk · 2025-11-20T19:24:55Z

python/pyspark/sql/pandas/serializers.py

+
+                # Check if the entire column is null
+                if data_column.null_count == len(data_column):
+                    return None


Given we've changed the implicit type signature of the function lets maybe add a type annotation on generate_data_batches for readability.

holdenk · 2025-11-20T19:26:36Z

python/pyspark/worker.py


        parsed_offsets = extract_key_value_indexes(arg_offsets)

+        import pandas as pd


Random place for an import

holdenk · 2025-11-20T19:30:35Z

python/pyspark/sql/pandas/serializers.py

+                    init_rows.append(init_row)
+
+                total_len = len(input_rows) + len(init_rows)
+                if total_len >= self.arrow_max_records_per_batch:


The SQLConf config param says if set to zero or negative number there is no limit, in this case if it's set to zero or a negative number we will always output a fresh batch per row. Let's change the behaviour and add a test covering this.

oh, right;
copied that from non-init state handling; 🫠
nice catch!

holdenk · 2025-11-20T19:38:30Z

python/pyspark/sql/pandas/serializers.py

+            def row_stream():
+                for batch in batches:
+                    if self.arrow_max_bytes_per_batch != 2**31 - 1 and batch.num_rows > 0:
+                        batch_bytes = sum(
+                            buf.size
+                            for col in batch.columns
+                            for buf in col.buffers()
+                            if buf is not None
+                        )
+                        self.total_bytes += batch_bytes
+                        self.total_rows += batch.num_rows
+                        self.average_arrow_row_size = self.total_bytes / self.total_rows


This logic seems to be duplicated from elsewhere in the file, maybe we can add it to a base class?

nyaapa added 3 commits November 18, 2025 01:31

[SPARK-54392][SS] Improve TWS PySpark high cardinality initial state …

35dd2b9

…load

save on empty dataframe creation

af121b2

a bit more

bdf52de

github-actions bot added SQL STRUCTURED STREAMING CORE PYTHON labels Nov 18, 2025

skip empty chunks

ab2d6c0

holdenk reviewed Nov 20, 2025

View reviewed changes

nyaapa added 3 commits November 20, 2025 21:03

adding types; move import

64dd204

treat non-positive arrowMaxRecordsPerBatch as unlimited

fcde4f9

format

621c23d

nyaapa force-pushed the SPARK-54392 branch from ec6b392 to 621c23d Compare November 21, 2025 18:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-54392][SS] Optimize JVM-Python communication for TWS initial state #53122

[SPARK-54392][SS] Optimize JVM-Python communication for TWS initial state #53122

nyaapa commented Nov 18, 2025

Uh oh!

holdenk left a comment

Uh oh!

holdenk Nov 20, 2025

Uh oh!

nyaapa Nov 21, 2025

Uh oh!

holdenk Nov 20, 2025

Uh oh!

holdenk Nov 20, 2025

Uh oh!

nyaapa Nov 20, 2025 •

edited

Loading

Uh oh!

nyaapa Nov 21, 2025

Uh oh!

holdenk Nov 20, 2025

Uh oh!

nyaapa Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		parsed_offsets = extract_key_value_indexes(arg_offsets)

		import pandas as pd

[SPARK-54392][SS] Optimize JVM-Python communication for TWS initial state #53122

Are you sure you want to change the base?

[SPARK-54392][SS] Optimize JVM-Python communication for TWS initial state #53122

Conversation

nyaapa commented Nov 18, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

holdenk Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

nyaapa Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

holdenk Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

holdenk Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

nyaapa Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nyaapa Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

holdenk Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

nyaapa Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nyaapa Nov 20, 2025 •

edited

Loading